Import Libraries

Load and Review Data

• The dataset has 21 columns and 10127 observations

• All columns have 10217 observations, meaning that there are no null values i.e. no columns with missing values

• The isnull function confirmed this for us.

• Age has 45 unique values i.e. most of the customers are of similar age.

• Majoirty of the numerical data types are continuous.

• Since all the values in ID column are unique we can drop it

Statistical Summary

• A lot of the continuous variables look to have outliers, given there is a large difference between quartile 3 and the max data point.

• This includes credit limit, revolving balance, open to buy, total transaction amount and count.

• Further insights of this will be gained in the Exploratory Data Analysis (EDA).

Checking for any Duplicate Data

• There does not appear to be any duplicate rows in the dataset.

Fixing Data Types

• Various columns are of type object. We can change them to categories.

• Converting "objects" to "category" reduces the data space required to store the dataframe

Given that Attrition_Flag is the target variable, which is categorical we will input 1 and 0 for the applicable classes we are concerned with. This will also allow us to compare variables easier for the EDA

• We can see that the 'object' data types have now been converted to 'category'

• Attrition_Flag just has 2 classes; 1 (Attrited Customer) and 0 (Existing Customer)

• We can see that the memory usage has decreased from 1.6+MB to 1.1MB.

Exploratory Data Analysis

Univariate analysis

Observation on Age

• The distribution of customer age from the histogram looks relatively normal, with mean and median relatively alligned.

• However, the boxplot shows that there are 2 outliers at the right end with a max of 73. This leads to a very slight right skew.

• We will not treat these outliers as they represent the real market trend of age, with respect to the banks customers.

Observation on Months on the Book

• The distribution for months on the book has an extreme peak at the mean/median value. Both of which are around 36.

• This indicates that the majority of customers have about 3 years (i.e. 36 months) on the books.

• The box plot indicates that there are outliers at both the left and right whiskers, which gives the distribution both left and right skews.

Observation on Credit Limit

• The distribution of credit limit has an extreme right skew. There is a large gap between the mean and median as a result

• There are outliers as displayed by the boxplot. A large portion of these outliers are in the range of 34K - 35K as displayed by the histogram

• Given there are a lot of cutomers in this range, we will not treat these outliers as they represent the real market trend of credit limit with respect to the bank.

Observation on Total Revolving Balance

• There does not appear to be outliers among revolving balance. The dispersion looks to fluctuate quite a bit.

• There is a large portion of customers that do not appear to have a revolving balance as shown by the histogram.

• Of those that do have a revolving balance, the highest portion of customers have a balance of roughly 2500.

Observation on Average Open to Buy

• Average open to buy is the difference between the credit limit and the revolving balance. Given the high portion of outliers among both of these variables, we would expect high outliers here. This is clear per the box plot.

• As a result, there is a strong right skew for average open to buy.

• Given this variable is dependent on 2 other independent variables and we will not be treating the outliers for those other 2 variables, we will not treat the outliers of average open to buys.

Observation on Total Transaction Amount

• The dispersion of total transaction amount also has a large right skew

• Various outliers can be seen outside the right whisker as can be seen from the boxplot, leading to this large skew.

• We will not treat these outliers as they represent the real market trend

Observation on Total Transaction Count

• The dispersion of the count of transactions over the last 12 months looks to be more normally distributed than some of the other independent variables.

• There are just 2 outliers as can been seen from the boxplot.

• We will not treat these outliers as they represent the real market trend

Observation on Average Utilization Ratios

• Average utilization ratio is the ratio of your revolving balance to your overall balance availble i.e Average Open to Buy.

• Given the strong right skew of average open to buy, we would expect similar trend here. This can be seen from both plots.

• There does not appeat to be any noticable outliers.

Observations on Attrition Flag

• The class distribution in the target variable is imbalanced.

• We have 84% (rounded) observations for existing customers and 16% (rounded) observations for attributed customers. This was also clear earlier when we looked at the shape of the data. However, a visual often gives a good representation just how large that difference is.

Observations on Attrition Gender

• Female customers are taking more credit than male customers, but not by much.

• There are approx 53% female customers and 47% of male customers

Observations on Dependent Count

• Majority of the customers have approximately 3 dependents (27%), followed closely by 2 dependents (26.2%).

• Roughly 9% of customers do not have any depdents and 4% of customers have 5 dependents.

Observations on Education Level

• Majorority of the customers (roughly 31%) have a graduate degree.

• This is followed by high school level cert at roughly 20% (ronded).

• Relatively high portions of the customers are either uneducated (14.7%) or we do not know their level of educations i.e. unknown (15%)

Observations on Marital Status

• Majorority of the customers are either married at 46.3% of single at 38.9%

• This is considerably higher than the other classes, as 7.4% of customers are divorced and 7.4% we do not know their marital status.

Observations on Income Category

• Majorority of customers earn less than $40K at 35% (rounded).

• 7% (rounded) of customers earn in excess of $120K. This would be expected as it is considerably greated than the average wage

• There is little variance between some of the other classes. There is however 11% of customers that we are unaware of their income level.

Observations on Card Category

• Unsurprisingly, the blue credit card is the dominant card among customes (roughly 93%). This would be expected, as they are the more commonly known, basic & affordable cards offered by banks.

• Silver makes up 5.5% of the customers credit cards and both gold and platinum make up just 1.3% of the remaining portion of the customers cards. Again, this would be expected given they are more expensive and generally consist of higher interest rates.

Observations on Relationship Count

• Roughly 23% (rounded) of customers have 3 relationshipes with the bank. Examples may include a debit, savings and credit account.

• There is consistency among customers with 4,5 and 6 relationships, all consisting of 18-19% of the customers. They account for a combined total of roughly 57% of the banks customers.

Observations on Months Inactive

• 3 months of inactivity accuonts for the highest amount of customers, at 38%. This is followed closely by customers with 2 months of inactivity (accounts for 32.4% of customers).

• 0 months of inactivity accounts for little to no customers. This implies that more of less all customers have had at least 1 month of inactivity.

Observations on Contacts Count 12 Month

• 2 and 3 contacts with the bank account for the largest portion of the customers. They account for 31.9% and 33.4% of customers.

Bivariate Analysis

Insights:

• A positive linear realtionship is evident between credit limit and average open to buy.

• Postive correlation is also evident between other variables, such as age and the average months on the books. We will discuss some of these relationships further, towards the end of the EDA.

Attrition_Flag vs Customer_Age, Credit_Limit, Total_Trans_Amt, Months_on_book

Insights

Insights

Attrition Flag vs Gender

• Earlier we saw there are more females than males. This shows that a higher percentage of female customers are more likely to renounce their account compared to males. Although the difference is minor.

Attrition Flag vs Marital Status

• There are no significant difference with respect to marital status.

• However, singles and unknowns look to have slighly more attrited customers than that of divorced and married customers.

Attrition Flag vs Education Level

• There are not significant difference with respect to the education level of customers, other than that of doctorates.

• Ironically doctorates appear to be the most likely to renounce their credit card.

Attrition Flag vs Income Category

• There is no significant difference across the different levels of income.

• However, customers that earn less than $40k look most likely to renounce their cred cards.

Attrition Flag vs Card Category

• Platinum card holders look most likely to renounce their credit cards, followed by gold card holders.

Attrition Flag vs Total Relationship Count

• Customers with 2 bank realtionships look more probable to renounce their credit card, followed by customers with 1 relationship. A relationship could be taken as the number of accounts or loans with the bank.

• Customers with more relationships appear less likely to renounce. This would imply these customers are more loyal customers for the bank.

Attrition Flag vs Contacts Count

• There looks to be an increasing positive correlation with the number of contacts with bank, and the likely hood for a customer to renounce their credit card. Customers that have been contacted 6 times are all attributed customers.

• This would imply that the longer the customer has no used their credit card, the more likely it is for the bank to contact them.

Attrition Flag vs Months Inactive 12 Month

• There looks to be an increasing positive correlation with the months of inactivity and the likely hood for a customer to renounce their credit card. This is consistent from 1-4 months of inactivity, but we then notice a decreasing trend.

• This implies that some customers begin to reuse their credit card after 4 months of inactivity.

Heatmap of Correlation

Insights

From the heatmap we can see that there is strong correlation between some of the independent variables. This was expected, given that some of these variables are dependent on one another from the outset.. For example:

  1. Average Utlization Ratio - this is the ratio of your revolving balance to your overall balance availble i.e Average Open to Buy. Given this, there is a strong positive correlation between this dependent variable and the Total Revoving Balance (as the revolving balance increases for a customer the utlization ration will increase). On the contrary, there is negative correlation with this variable and the average open to buy.

  2. Total Transaction Amount - this displays high correlation with the total transaction count. We also saw this from the boxplots earlier. This would be expected, as when a customer has more transaction its expected that they would spend more i.e. transaction amount will increases. Similary the Q4-Q1 comparisons show a postiive correlation, but this is not as strong a correlation.

  3. Months on Book - strong positive correlation with age. Again this would be expected, as the older someone gets the more probable it is they will increase the amount of time they are with the bank.

  4. Credit Limit - shows 100% postive correlation with average open to buy. That is becuase when the revolving balance is zero, these metrics are equivalent. Given this, its also going to have a negative correlation with the utilization ratio, as we saw for average open to buy.

Data Pre-Processing

Missing Value Treatment

• From the EDA there are some variables with a relatively high percentage of values that are 'Unknown' classes.

• Given this relatively high percentage, I will treat these as null values. I will convert these calues to NaN and then I will use KNN imputer to impute these missing values

• Also, I am choosing not to treat outliers for this project, as the outliers in this dataset represent the real market trend for the bank. However, it is worth noting that ideally it would be good to have model comparisons. If we had the time, we could compute the models with outliers and also compare this to models where we treat the outlier values. This is a bias decision, but I am happy to proceed for the purpose of this project.

• Educational level, marital status and income category all have null values now that we have transformed the unknowns.

Values have been encoded.

Given the high correlation between some of the independent variables, we will look to drop these variables. High correlation with dependent variables can increase the bias of a data set when building the model.

Split the data into train and test sets

Imputing Missing Values

• All missing values have been treated.

• Let's inverse map the encoded values.

• Checking inverse mapped values/categories.

Before building the model, let's create functions to calculate different metrics- Accuracy, Recall and Precision and plot the confusion matrix. Functions make sense in this case as we will be creating various different models throughout this project.

Model evaluation criterion

False Negatives: Reality: A customer renounced their credit card. Model predicted: A customer did NOT renounce their credit card account. Outcome: a loss of income for the bank as a result of the customer renouncing their account.

For this project, our goal is to be able to predict which type of customer will renounce their credit card account and provide insight/recommendations for the bank. In this case, not being able to identify a potential customer that will renounce their credit card is the biggest loss to the bank. Minimizing this loss, will essentially save money for the bank as they can anticiapte any renouncers ahead of time, allowing the bank to improve in specific areas that may prevent this. Hence, recall is the right metric to check the performance of the model. The bank will want recall to be maximized i.e. we need to reduce the number of false negatives. Recall gives the ratio of True positives to Actual positives, so high Recall implies low false negatives

Given that there is a large bias in the data i.e. the percentge of existing customers is significantly greater than that of attrited customers, models are going to be bias towards the dominant class i.e. exisiting customers. This implies that it will be difficult from the outset for models to predict renouncers (our overall goal). We will need to use methods of sampling as well as hyper tuning to try improve the performance of models.

Logistic Regression

Let's evaluate the model performance by using KFold and cross_val_score

K-Folds cross-validator provides dataset indices to split data into train/validation sets. Split dataset into k consecutive stratified folds (without shuffling by default). Each fold is then used once as validation while the k - 1 remaining folds form the training set.

• Performance on the training set varies between 0.20 and 0.345. This is extremely low.

• Lets check the performance on the test set

• Logistic Regression has given a relatively poor performance on the training and test outputs.

• Recall is very low and thus the model is relatively poor at predicting those that would renounce verus not renounce

• We can use methods of upsampling and downsampling to see if we can improve the modeel.

SMOTE to upsample smaller class

Logistic Regression on UpSampled data

Let's evaluate the model performance by using KFold and cross_val_score

• K-Folds cross-validator provides dataset indices to split data into train/validation sets. Split dataset into k consecutive stratified folds (without shuffling by default). Each fold is then used once as validation while the k - 1 remaining folds form the training set.

• Performance of model on training set varies between 0.7385 to 0.7885, which is an improvement from the initial model(without upsampling).

• Let's check the performance on the test set.

Regularization

Insights

• Taking the approach of upsampling (which adds synthethic data to the smaller class) on the logistic regression has greatly improved the perfomrance of the model. The recall has increased considerably.

• The model is now better at predicting renouncers versus non renouncers. The confusion matrix percentages provide clear evidence of this, given the reduced percentage of false negatives (bottom left section).

• However, it is worth noting that there is now some evidence of overfitting between the recall training and test results.

• Taking the approach of regularization we are able to reduce this overfitting, but the recall performance is reduced.

Down Sampling the larger class

• We will now compare the method of upsampling (adding synthetic data) of the smaller class to that of down sampling of the larger class. This is another aproach that can be taken to reduce the imbalance within a dataset.

• There is significant bias for the existing customer class, as has been discussed throughout this project. We will down sample this class using random

Logistic Regression on undersampled data

Let's again evaluate the model performance by using KFold and cross_val_score

• Performance of model on training set varies between .71825 to 0.75575, which is an improvement from the initial model(without any resampling).

• Let's check the performance on the test set.

Insights

• Model performance has improved using downsampling - referring to the confusion matrix the logistic regression is now better at differentiating between positive and negative classes.

• The recall has again increased substantially and there is less evidence of overfitting given there is little difference between the recall training and test results.

• However, it is worth noting that there is evidence of overfitting for precision. Given this is not our metric of concern we will not use any regularization here. This would be a step we could take to cap certain coefficient values, however given there is no large evidence of recall overfitting, we will not proceed with regularization.

Model building - Bagging and Boosting

Models to HyperTune

• We can see that XGBoost is giving the highest cross validated recall followed by gradient boosting and decision tree. For XGB there are outliers above an beneath the model.

• In an ideal scenario I would proceed with the 3 highes recall scores. However, given the computational time for gradient boosting, I have chosen to proceed with XGB, ADB and the DTREE as the models of choice. This will be consistent for both GridSearch and Randomized Search

• Given the good performnce of these 3 modes in terms of the recall cross validation score, we will select these models to hypertune.

We will use pipelines with StandardScaler and each model indivudally, to tune the model using GridSearchCV and RandomizedSearchCV. We will also compare the performance and time taken by these two methods - grid search and randomized search.

We can also use make_pipeline function instead of Pipeline to create a pipeline.

make_pipeline: This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.

First, let's create two functions to calculate different metrics and confusion matrix, so that we don't have to use the same code repeatedly for each model.

AdaBoost

GridSearchCV

XGBoost

GridSearchCV

Decision Tree

RandomizedSearchCV

Ada Boost

XGBoost

Decision Tree

Model Performances

The models of choice were AdaBoost, XGBoost and the Decision Tree. We have computed each of these models and reviewed the performance of these models. Our main metric of concern is recall. We are looking for the highest recall values, which minimizes any overfitting and minimizes the percentage of false negatives. Hypertuning these models using both GridSearch & Randomized search through pipelines has also been completed, which has greatly improved the performance of these models in terms of the recall. I will provide a summary table to compare these results.

Model Performances

• For AdaBoost the performance is the same from both Grid Search and Randomized Search. Decision tree has better recall performance using GridSearch but the difference is minimal.

• The model of best performance is the XGBoost through randomized search with a test recall performance of .93. Also there is no evidence of overfitting. The peerformance is still very good using GridSearch.

• Its also worth noting that the time taken for randomized search was significantly less than that of grid search. This was particulary noticable for XGBoost - this took 1h 12min 30s with Gridsearch and just 1min 38s with randomized search. Randomized search was also less for both AdaBoost and the decision tree classifier. Given that GridSearch is computationally exhaustive using randomized search is the better methor to proceed with, as results tend to be as good if not better, and they are computed in significantly less time.

• Given that XGBoost was the model of best performance we will take a look at the feature importance from the tuned xgboost model. We will also take a look at the feature importance of the other models, to see if there is consistency amongst all 3.

Feature Importance - Tuned Decision Tree

Feature Importance - Tuned Ada

Feature Importance - Tuned XGBoost

Insights

• Total transaction amount is the main feature across all 3 models.

• For the XGBoost model (best model of performace) it is the domiant feature that contirbutes to predicting renouncers of credit card, followed closely by total revolving balance and changes in transaction amounts between Q4 & Q1.

• Addional important features include months of inactivity, relationship count, months on the book and relationship count.

• Its worth noting that a lot of features do not contribute to the prediction of renouncers versus non renouncers in the model.

Actionable Insights & Recommendations

The model of best performance could be used to help identify certain customers that may renounce their credit card. Given that the main features within this model were transaction amount, revolving balance, changes of tansaction amounts from Q1-Q4, relationship count this can provide some good insight for the bank.

For example, if the bank is able to identify extreme reductions in transaction amounts from customer they could contact the customers to inform them of any additioal offers that they have, or they could host events to promote certin new features. In the modern era a large portion of transaction are completed online. Given this, the bank should ensure that they have a user friendly online banking system. Having a questionaire for feedback may help the bank, as they could identify additional areas they need to improve.

Additionally, the bank should ensure to have a benfits system in play to encourage a customer to make transactions. For example offering a customer 1% credit on each transaction can entice the customer to purchase more. Having addional agreements with restaurants & shpos etc. can help promote additional transactions e.g. Offer 1.5-2% for certain supermarkets.

Revolving balance is the credit balance carried over. This is a strong feature in terms of predicting renouncers of credit cards. It is a bad for customers to increase their revolving balance. Generally this will be expensive as interest rates will increase, which may be a contribution to renouncing due to revolving balance. If the bank can identify substantial increases to a customers revolving balance early, they can attempt to contact the customer prior to increasing interest rates.

Given relationship count is also an important feature the bank also needs to maintain its customers with various accounts/loans. Regular communication and offers to these customers can help keep them satisfied and improve their longevity. Also, we verified from the EDA that there was a positive correlation with attrited customers and the number of contacts i.e. the bank will contact attrited customers ahead of existing customers. However, if the bank can look to contact these customers earlier it may help to prevent any existing customers becoming an attrited customers. The bank should look to become more proactive by contacting its current customers, if they show signs of inactivity and also just in general to ensure that they are satisfied. A satisfactory survey via email or by text message may assist with this.